I was one of the fortunate few, relatively speaking, to be physically present at Intel's launch of their amazing Lunar Lake processor at IFA Berlin Sep 2-5. I even got the chance to speak a bit about our team's extremely positive experience with Lunar Lake based on early engineering samples and elaborate on some of the specific Intel system optimizations that our team is working on.
Introducing CANVID AI Features
In the demo area, Intel demonstrated a preview of some awesome new and upcoming CANVID AI features aimed to boost productivity and make CANVID productions stand out even more ;).
Camera AI Background (Available Now) - Say goodbye to cluttered backgrounds! Remove, replace or blur your background at the click of a button. Camera AI Background is now available to everyone!
Synthetic AI Webcam (Private Beta) - Just wear your undies and focus on the screen recording, you don’t need to record your webcam at the same time. You can generate a synthetic lip synced webcam anytime, based on another recording when you looked your best ;)
Generative AI Retakes (Internal Beta) - Manual video editing is old school! When you are done recording, select any paragraph of the auto generated transcript you want to change. Re-speak the corresponding audio and your webcam will automatically regenerate to match the new audio (not demoed)
A Behind the Scenes Look
Throughout the rest of this semi-technical article I’ll give you a behind the scenes look into how the new features work, mainly focusing on the “Camera AI Background”. I will specifically look at performance data for Intel Core™ Ultra (Lunar Lake and Meteor Lake), and some of the optimization we have implemented, which in turn is also aimed at boosting performance on older hardware - but before I dive into any technical jargon let me just reiterate:
Camera AI background is available to all users in CANVID’s latest public update. Just toggle on “Camera AI Background” from the camera tab inside the editor, it’s that simple! Click here to see the user guide.
Synthetic AI Webcam and Generative AI Retakes are still in private or internal beta, but if you’re a licensed CANVID user or an influencer eager to test or promote these upcoming features you can apply for early access here.
Camera AI Background
The implementation of “Camera AI background” in CANVID totally rocks!! We leverage best-in-class talking-head AI video background segmentation similar to the technology used in our sister company’s products XSplit VCam and VCam.ai, but instead of performing AI inference (generating foreground alpha masks) and composition (producing the final picture) in a single pass, we’ve separated this into two steps to optimize playback and export performance of CANVID.
Divide & Conquer is a Winning Strategy
When switching on “Camera AI Background” we start by segmenting the webcam foreground from the background and generating foreground alpha masks for each individual camera frame. To ensure that errors and inaccuracies appear less subtle, we also blend the masks between neighboring frames.
When alpha masks are generated, we store these as compressed H.264 frames with timestamps matching with the original H.264 compressed webcam recording. To compress the 8 bit alpha masks using H.264 codec we construct video frames with the alpha mask in the luma (Y) plane and fill the chroma planes with neutral 128 value (or so I thought, because if that was entirely true, you would see a grayscale guy instead of a green martian -;). It's slightly wasteful to encode data you don’t need, but it’s the price we decided to pay to store the masks as a ”video track”.
Now that we have the foreground alpha masks inside the CANVID project file, we have effectively removed the inference workload from playback and export. Using the already generated masks it's now easy to dynamically compose a new background, blur the existing background or completely remove the background allowing the talking head to blend seamlessly into the screen recording.
The AI model inference workload to generate the alpha masks is performed on the most capable hardware component available—whether it's the CPU, GPU, or NPU. On any device equipped with a compatible NPU, the inference is always handled by the NPU.
On Intel Core™ Ultra (Meteor Lake), the inference workload takes just 7 milliseconds per frame, and on the NPU of the new Core™ Ultra series 2 (Lunar Lake), the same workload is completed in just over 2 milliseconds - that is blazing fast, but due to additional system processing workloads, such as copying, compressing, and storing, the full processing time only translates into to about 6 to 10x real-time. This can probably still be improved a good bit, but not a bad start;)
Why Wait If You Don't Have To?
If the Camera AI Background feature is frequently used, waiting at least around 1 minute to generate alpha masks for a 10-minute recording is still not an optimal way to spend your time. So on Core™ Ultra or systems with a capable GPU (minimum Intel 11th Gen Xe or equivalent), CANVID allows users to pre-generate foreground alpha masks during the initial screen and camera recording, with virtually no impact on performance.
These images from Task Manager illustrate resource usage for a combined 2K screen recording and SD camera recording with and without alpha mask pre-generation on an Intel Core™ Ultra laptop (Meteor Lake).
As shown, pre-generating the alpha masks during recording is resource-efficient, adding only minimal CPU usage. This additional CPU load primarily stems from extra GPU-GPU copy operations and other processing overhead. The diagram shows the extra process steps (Alpha mask generation) required to generate alpha masks during recording.
While there is potential for further optimization, the current implementation is robust and has no noticeable impact on the recording experience on capable hardware.
Composing the Alpha Masks and Original Camera Frames
During export and playback, we focus solely on composing the alpha masks with the original camera frames. As mentioned, this two-step approach prevents export speed from being constrained by the combined workload of inference and compositing. If these processes were handled together, older hardware would struggle, but by separating them, CANVID runs efficiently across most systems.
When exporting our 2 minute and 15 second demo project (Demo Project A) with and without the background removed, we observe the following export times on a Core Ultra laptop (Meteor Lake):
The export times for compositions with the camera remain consistent when accounting for a standard error of 500 milliseconds. This translates to an export speed of approximately 120 frames per second, which corresponds to 2x real-time export for 1080p at 60fps and 4x real-time export for 1080p at 30fps.
The actual compositing resolution of the camera is currently the same for any export resolution, so the 720p export speed entry in the table above just aims to show that the introduction of the “Camera AI background” does affect export speed even at the higher speeds.
Synthetic Webcam & Generative Retakes
Synthetic AI webcam and Generative AI Retakes will roll out in Q4 2024. We hope these features will have a major impact on creativity and productivity.
Synthetic Webcam
The most basic use-case for the synthetic webcam is to allow users to generate a webcam in existing screen recordings, even when a webcam wasn't originally used.
While the accuracy of our AI model for face and lip generation is still being refined, inaccuracies are less noticeable since the camera in CANVID is currently limited to Picture-in-Picture (PiP) view.
On Intel Core™ Ultra we can achieve generation of synthetic webcam at speeds of 80+ fps, and on the newly launched Intel Core™ Ultra Series 2 we're hitting 120+ fps, equivalent to more than 4x real-time speed which is pretty amazing !! We anticipate the final AI model may be slightly larger, but we have ample room for optimization, so expect a comparable speed when the feature is officially released in early Q4, 2024. While we may also end up providing a cloud execution option, the numbers speak for themselves - if you have modern hardware, the local execution will probably be faster when factoring in cloud round-trip latency.
Looking ahead and beyond the initial feature release, the synthetic webcam feature will likely become a cornerstone for creating multilingual recordings. In these recordings, the presenter's voice would be cloned to generate audio in multiple languages, and the talking head would be generated for each language using the synthetic webcam technology – cool right?, exciting times ahead!
Generative Retakes
Generative Retakes is a feature that allows users to modify one or more "paragraphs of audio" by selecting sections of the transcript. In version 1 of the feature, which we anticipate releasing in Q4 2024, users can revise the audio by re-speaking the paragraphs they wish to add or modify. CANVID will then generate new synthetic talking head webcam frames to match the updated or newly added audio segments.
In subsequent versions of this feature we aim to enable corrections or additions to be made exclusively by editing the video transcript, utilizing a voice clone AI model to generate the audio instead of re-speaking.
Stay tuned for updates to this article, where we'll include demos of this feature.